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CN I Abstract 

^T*] Several Krylov-type procedures are introduced that generalize matrix Krylov methods 

for tensor computations. They are denoted minimal Krylov recursion, maximal Krylov 
recursion, contracted tensor product Krylov recursion. It is proved that the for a given 
vQ ■ tensor with low rank, the minimal Krylov recursion extracts the correct subspaces asso- 

ciated to the tensor within certain number of iterations. An optimized minimal Krylov 
procedure is described that gives a better tensor approximation for a given multilinear 
rank than the standard minimal recursion. The maximal Krylov recursion naturally ad- 
mits a Krylov factorization of the tensor. The tensor Krylov methods are intended for the 

(-H I computation of low-rank approximations of large and sparse tensors, but they are also 

useful for certain dense and structured tensors for computing their higher order singular 
value decompositions or obtaining starting points for the best low-rank computations of 
tensors. A set of numerical experiments, using real life and synthetic data sets, illustrate 
some of the properties of the tensor Krylov methods. 
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^P , 1. Introduction 

O 



Large-scale problems in engineering and science often require solution of sparse linear 
algebra problems, such as systems of equations, and eigenvalue problems. Recently 
[l|, y, la, Il8l420| it has been shown that several applications in information sciences, such 
. , as web link analysis, social networks, and cross-language information retrieval, generate 

j^ ' large data sets that are sparse tensors. In this paper we introduce new methods for 

efficient computations with large and sparse tensors. 

Since the 1950's Krylov subspace methods have been developed so that they are now 
one of the main classes of algorithms for solving iteratively large and sparse matrix prob- 
lems. Given a square matrix A £ R"'*" and a starting vector u £ R" the corresponding 
/c-dimensional Krylov subspace is 

JCk{A, u) = span{u, An, A^u, . . . , A'^~^u}. 
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In floating point arithmetic the vectors in the Krylov subspace are useless unless they 
arc orthonormalized. Applying Gram-Schmidt orthogonalization one obtains the Arnoldi 
process, which generates an orthonormal basis for the Krylov subspace ICk{A,u). In 
addition, the Arnoldi process generates the factorization 

AUk - Uk+iHk. (1) 

where Uk = [ui . . . Ufc], and Uk+i = [Uk u^+i] with orthonormal columns, and Hk is a 
Hessenberg matrix with the orthonormalization coefficients. Based on the factorization 
(H)) one can compute an approximation of the solution of a linear system or an eigenvalue 
problem by projecting onto the space spanned by the columns of Uk, where k is much 
smaller than the dimension of A; on that subspace the operator A is represented by the 
small matrix Hk- This approach is particularly useful for large, and sparse problems, 
since it uses the matrix A in matrix- vector multiplications only. 

Projection to a low-dimensional subspace is a common technique in many areas of 
information science. This is also the case in applications involving tensors. One of the 
main theoretical and algorithmic problems researchers have addressed is the computation 
of low rank approximation of a given tensor [9|, [ij, [IGI |23|, [26 1 . The two main approaches 



are the Canonical Decomposition [5l |l2| and the Tucker decomposition [30]; we are 
concerned with the latter. 

The following question arises naturally: 

Can Krylov methods be generalized to tensors, to be used for the projection 
to low- dimensional subspaces? 

We answer this question in the affirmative, and describe several alternative ways one 
can generalize Krylov subspace methods for tensors. Our method is inspired by Golub- 



Kahan bidiagonalization [10|, and the Arnoldi method, see e.g. [27|, p. 303]. In the 
bidiagonalization method two sequences of orthogonal vectors are generated; for a tensor 
of order three, our procedures generates three sequences of orthogonal vectors. Unlike 
the bidiagonalization procedure, it is necessary to perform Arnoldi style orthogonaliza- 
tion of the generated vectors explicitly. For matrices, once an initial vector has been 
selected, the whole sequence is determined uniquely. For tensors, there are many ways 
in which the vectors can be generated. We will describe three principally different tensor 
Krylov methods. These are the minimal Krylov recursion, maximal Krylov recursion 
and contracted tensor product Krylov recursion. In addition we will discuss the imple- 
mentation of an optimized version |11| of the minimal Krylov recursion, and we will 
show how to deal with tensors that are small in one mode. For a given tensor A with 
rank(^) = (p, q, r) the minimal Krylov recursion can extract the correct subspaces as- 
sociated to A in max{p, g, r} iterations. The maximal Krylov recursion admits a tensor 
Krylov factorization that generalizes the matrix Krylov factorization. The contracted 
tensor product Krylov recursion is a generalization of the matrix Lanczos method ap- 
plied to symmetric matrices A^ A and AA^ . 

Although our main motivation is to develop efficient methods for large and sparse 
tensors, the methods are useful for other tasks as well. In particular, they can be used 
for obtaining starting points for the best low rank tensor approximation problem, and 
for tensors with relatively low multilinear rank they provide a way of speeding up the 
computation of the Higher Order SVD (HOSVD) [7.]. The latter part is done by first 
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computing a full factorization using the minimal Krylov procedure and then computing 
the HOSVD of the much smaller core tensor that results from the approximation. 

The paper is organized as follows. The necessary tensor concepts are introduced 
in Section [51 The Arnoldi and Golub-Kahan procedures are sketched in Section [H In 
SectionSlwe describe different variants of Krylov methods for tensors. Section [S] contains 
numerical examples illustrating various aspects of the proposed methods. 

As this paper is a first introduction to Krylov methods for tensors, we do not imply 
that it gives a comprehensive treatment of the subject. Rather our aim is to outline our 
discoveries so far, and point to the similarities and differences between the tensor and 
matrix cases. 

2. Tensor Concepts 

2.1. Notation and Preliminaries 

Tensors will be denoted by calligraphic letters, e.g A, B, matrices by capital ronian 
letters and vectors by lower case roman letters. In order not to burden the presentation 
with too much detail, we sometimes will not explicitly mention the dimensions of matrices 
and tensors, and assume that they are such that the operations are well-defined. The 
whole presentation will be in terms of tensors of order three, or equivalently 3-tensors. 
The generalization to order- A^ tensors is obvious. 

We will use the term tensor in a restricted sense, i.e. as a 3-dimensional array of real 
numbers, A G ]ij'x™x"^ where the vector space is equipped with some algebraic structures 
to be defined. The different "dimensions" of the tensor are referred to as modes. We will 
use both standard subscripts and "MATLAB-like" notation: a particular tensor element 
will be denoted in two equivalent ways, 

A{i,j,k) = Oijk. 

A subtensor obtained by fixing one of the indices is called a slice, e.g., 

A{t,:,:). 

Such a slice can be considered as an order-3 tensor, but also as a matrix. 
A fibre is a subtensor, where all indices but one are fixed, 

Aii,:,k). 

For a given third order tensor, there are three associated subspaces, one for each mode. 
These subspaces are given by 

Range{^(:, j, k) \ j — 1 : m, k — 1 : n}, 
Range{^(i, :, fc) | i = 1 : Z, fc = 1 : n}, 
Range{^(i, j, :) \ i = 1 : I, j = 1 : m}. 

The multilinear rank [8|, llSl ] of the tensor is said to be equal to (p, q, r) if the dimension 
of these subspaces are p, q, and r, respectively. 

It is customary in numerical linear algebra to write out column vectors with the ele- 
ments arranged vertically, and row vectors with the elements horizontally. This becomes 
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inconvenient when we are dealing with more than two modes. Therefore we will not 
make a notational distinction between mode-1, mode-2, and mode-3 vectors, and we will 
allow ourselves to write all vectors organized vertically. It will be clear from the context 
to which mode the vectors belong. However, when dealing with matrices, we will often 
talk of them as consisting of column vectors. 

2.2. Tensor-Matrix Multiplication 

We define multilinear multiplication of a tensor by a matrix as follows. For concrete- 
ness we first present multiplication by one matrix along the first mode and later for all 
three modes simultaneously. The mode-1 product of a tensor A G ]r'x™x" by a matrix 
[/ e IRP^' is definecfl 

^pxmx« 3 g ^ (^)^ . ^^ ^^^.^ ^ ^ u,^a^,k- (2) 

This means that all mode-1 fibres in the 3-tensor A are multiplied by the matrix U. 
Similarly, mode-2 multiplication by a matrix V G M'^™ means that all mode-2 fibres are 
multiplied by the matrix V . Mode-3 multiplication is analogous. With a third matrix 
W G R''^", the tensor-matrix multiplicatioqj in all modes is given by 

RP^'i^- 3 B ^ iU,V,W) ■ A, h,k= Y. ^^o.v,0Wkya^p^, (3) 

a,/3,7— 1 

where the mode of each multiplication is understood from the order in which the matrices 
are given. 

It is convenient to introduce a separate notation for multiplication by a transposed 
matrix U gR'^'p-. 

I 

a=l 

Let u e M' be a vector and A £ m'x™><" a tensor. Then 

jglxmxn ^j^._ ^^T^^ -A^A- (u) ^ =Be K"^". (5) 

Thus we identify a tensor with a singleton dimension with a matrix. Similarly, with 
u G R' and w G R", we will identify 

Ri^"^i 9C:=y^-(u,u;)i3 = cGR", (6) 

i.e., a tensor of order three with two singleton dimensions is identified with a vector, here 
in the second mode. Since formulas like ^ have key importance in this paper, we will 
state the other two versions as well; 

^•(u,w)i_2 eR", A-iv,w)^;^eR', (7) 

where v G R'" . 



^The notation l(2}-l(3ll was suggested by Lim §]. An alternative notation was earlier given in 01 ■ Our 
(X)^ • J\. is the same as yt x^ X in that system. 

■^To clarify the presentation, when dealing with a general third order tensor A. , we will use the 
convention that matrices or vectors U,Uk,Ui, V,Vk,Vi and W, Wk , Wi are exclusively multiplied along the 
first, second, and third mode of ,4, respectively, and similarly with matrices and vectors X, Y, Z, x, y, z. 
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2.3. Inner Product, Norm, and Contractions 

Given two tensors A and B of the same dimensions, we define the inner product, 

{A,B) = ^ Oap^bap^. (8) 

a,/3,7 

The corresponding tensor norm is 

\\A\\ = {A,AY'^. (9) 

This Frobenius norm will be used throughout the paper. As in the matrix case, the norm 
is invariant under orthogonal transformations, i.e. 

\\A\\^\\{U,V,W)-A\\ = \\A-{P,Q,S)\\, 

for orthogonal matrices U, V, W, P, Q, and S. This is obvious from the fact that 
multilinear multiplication by orthogonal matrices does not change the Euclidean length 
of the corresponding fibres of the tensor. 

For convenience we will denote the inner product of vectors x and y in any mode (but, 
of course, the same) by x^y. Let v = A- {u, w)-^^ ^; then, for a matrix y = [ui W2 ■ • ■ Vp] 
of mode- 2 vectors, we have 

V^v= {V^)^-iA-iu,w),^^)^A-{u,V,w) eRi^P'^i. 

The following well-known result will be needed. 

Lemma 1. Let A G ]R'^™^" be given along with three matrices with orthonormal 
columns, U G K'^p. V G M™^'?, and W G M"^'",. where p < I, q < m, and r < n. 
Then the least squares problem 

m:m\\A-{U,V,W)-S\\ 
s 

has the unique solution 

S = {U^, V^, W^) •A = A-{U,V,W). 

The elements of the tensor S are given by 

sx^j.u^ A- {u\,v^_,,w^) , l<A<p, l<^J.<q, l<i'<r. (10) 

Proof. The proof is a straightforward generalization of the corresponding proof for ma- 
trices. Enlarge each of the matrices so that it becomes square and orthogonal, i.e., put 

U=[UUa_], V=[VVa}, W=[WWa}. 

Introducing the residual TZ = A — {U, V, W) • S and using the invariance of the norm 
under orthogonal transformations, we get 

||7^f = \\n ■ [u, V, w) f = \\A ■ {u, v, w) - sf + c\ 

where C^ = ||^ • {U±,V±,W±) |p does not depend on S. The relation (ITUl) is obvious 
from the definition of tensor- matrix product. D 
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The inner produ ct (pi ) can be considered as a special case of the contracted product 
of two tensors, cf. [ITI . Chapter 2], which is a tensor (outer) product foUowed by a 
contraction along specified modes. Thus, if A and B are 3-tensors, we define, using 
essentially the notation of [3|, 

C = {A,B)-^ , Cjkj'k' = '^aajkbaj'k' , (4-tensor) , (ff.a) 

D = {A,B)-^^^^ , dkk' ^'^aapkbapk' , (2-tensor), (H-b) 

a,li 

e = {A,B) = {A,B)-^ 3 , e~ ^ aap-^bap-y , (scalar). (H-c) 

It is required that the contracted dimensions are equal in the two tensors. We will refer 
to the first two as partial contractions. The subscript I in {A,B)i and 1,2 in {A,B)i^2 
indicate that the contraction is over the first mode in both arguments and in the first 
and second mode in both arguments, respectively. It is also convenient to introduce a 
notation when contraction is performed in all but one mode. For example the product 
in (jll.bp may also be written 

{AB),^^^{AB)_^. (12) 

The definition of contracted products is valid also when the tensors are of different order. 
The only assumption is that the dimension of the correspondingly contracted modes are 
the same in the two arguments. The dimensions of the resulting product are in the order 
given by the non-contracted modes of the first argument followed by the non-contracted 
modes of the second argument. 

2.4- Tensor Matricization 

Later on we will also need the notion of tensor matricization. Any given third order 
tensor A € R'^™^" can be matricized alone its different modes. These matricizations 
will be written as A^^\ which is an Z x mn matrix, A^^', which is an ttt, x In matrix, and 
A^^"* is an n X Im matrix. The exact relations of the entries of A to the three different 
matricizations can be found in [9]. It is sufficient, for our needs in this paper, to recall 
that the matricizations of a given nuiltilinear tensor-matrix product B — (C/, V, W) • A 
have the following forms: 

5(1) = UA^^'>{V(x)Wf, 
S(2) = VA'-^'^{U(x)Wf, 
B^^^ = WA^^\U ^Vf. 

3. Two Krylov Methods for Matrices 

In this section wc will describe briefly the two matrix Krylov methods that are the 
starting point of our generalization to tensor Krylov methods. 



3.1. The Arnoldi Procedure 

The Arnoldi procedure is used to compute a low-rank approximation/factorization 
dJ) of a square, in general nonsymmetric matrix A. It requires a starting vector m —: Ui 
(or, alternatively, vi =: Vi), and in each step the new vector is orthogonalized against 
all previous vectors using the modified Gram-Schmidt process. We present the Arnoldi 
procedure in the style of [23, P- 303]. 

Algorithm 1 Arnoldi Procedure 
for z = 1, 2, . . . , fc do 

1 hi = Uj Aui 

2 hi+i^iUi+i = Aui - Uihi 

3 t/i+i = [Ui Ui+i] 



4H,= 


Hi-i 



h^ ' 

hi+i.i 


end for 







The constant /ii+i,i is used to normalize the new vector to length one. Note that the 
matrix Hk in the the factorization ([1]) is obtained by collecting the orthonormalization 
coefficients hi and ft-i+i.i in an upper Hessenberg matrix. 

3.2. Golub-Kahan Bidiagonalization 

Let A € R™>^" be a matrix, and let (3iUi,vo = 0, where ||ui|| = 1, be starting vectors. 
The Golub-Kahan bidiagonalization procedure [lO| is defined by the following recursion. 



Algorithm 2 Golub-Kahan bidiagonalization 



for i ~ 1,2, ... ,k do 

1 aiVi = A'^Ui - PiVi^i 

2 /3i+iUt+i = Avt - aiUi 
end for 



The scalars a^, /3i are chosen to normalize the generated vectors Vi,Ui. Forming the 
matrices Uk+i = [ui • • • Uk+i] G K^^^C'+i) and Vk = [I'l ■ ■ ■ Vk] & W^'^^ , it is straightfor- 
ward to show that 

AVk = Uk+iBk+i, A^Uk^VkBk, (13) 

where V^Vk = /, Uj^-JJk+i = I, and 

/32 062 



Bk+1 — 

















\ ^'' 1 


h 


Ctk 




Pk+iel. 




Pk+l_ 





B(fe + l)xfc 



(14) 



is bidiagonaO. 



^Note that the two sequences of vectors become orthogonal automatically; this is due to the fact 
that the bidiagonalization procedure is equivalent to the Lanczos process applied to the two symmetric 
matrices AA^ and A^ A. 

7 



Using tensor notation from Section 12.21 and a special case of the identification ([6]) , 
we may express the two steps of the recursion as 

Algorithm 3 Golub-Kahan bidiagonalization in tensor notation 
for i — 1,2, ... ,k do 

1 aiVi = A • (mj)j - (3iVi^i 

2 ^i+iUj+i = A ■ (vi)^ - aiUi 
end for 

We observe that the u,; vectors "live" in the first mode of A, and we generate the 
sequence U2,U3, . . ., by multiplication of the Vi vectors in the second mode, and vice 
versa. 



4. Tensor Krylov Methods 

4.1. A Minimal Krylov Recursion 

In this subsection we will present the main process for the tensor Krylov methods. 
We will further prove that, for tensors with rank(^) = {p, q, r), we can capture all three 
subspaces associated to A within max{p, q, r} steps of the algorithms. Finally we will 
state a partial factorization that is induced by the procedure. 

Let A G R''^™^" be a given tensor of order three. It is now straightforward to 
generalize the Golub-Kahan procedure, starting from Algorithm |31 Assuming we have 
two starting vectors, ui G MJ and vi G M™ we can obtain a third mode vector wi — 
A • (ui, wi)-^ 2 G I^"- We can then generate three sequences of vectors 

Wj+l = A- (ui,Wi)2,3 > (15) 

Wj+i = A- (ui,ii;j)i 3, (16) 

Wi+i = A- {ui,Vi)^^, (17) 

for i — 1, . . . ,fc. We see that the first mode sequence of vectors (ui+i) are generated 
by multiplication of second and third mode vectors {vi) and (wi) by the tensor A, and 
similarly for the other two sequences. The newly generated vector is immediately or- 
thogonalized against all the previous ones in its mode, using the modified Gram-Schmidt 
process. An obvious alternative to P^ and (|T7)) that is consistent with the Golub- 
Kahan recursion is to use the most recent vectors in computing the new one. This 
recursion is presented in Algorithm |4l In the algorithm description it is understood that 
Ui = [uiU2 ■■• Ui], etc. The coefficients au, a.^, and a^ are used to normalize the 
generated vectors to length one. 

For reasons that will become clear later, we will refer to this recursion as a minimal 
Krylov recursion. 

The process may break down, i.e. we obtain a new vector u^+i, for instance, which 
is linear combination of the vectors in Ui. This can happen in two principally different 
situations. The first one is when, for example, the vectors in Ui span the range space of 
^4*-"'^^. If this is the case we are done generating new m- vectors. The second case is when 



Algorithm 4 Minimal Krylov recursion 



Given: two normalized starting vectors ui and vi, 
a^wi = A- (wi,wi)i 2 
for i = l,2,...,fc-i do 

u = A- {vi,Wi)^ 3 ; /i„ = Uju 



auU^+i=u-U.,K\ Hf = 


'Hf K 

a„ 




V = A- {u.i+i,Wi)^r^\ hy^V.^v 


avVi+i =v - ViK; HI' = 


a^ 




id = A- {u,+i,v,+i)^.^; hyj^W^w 


a^Wj+i = w - Wihw] Hf = 


a^ 


end for 









we get a "true breakdown'^, Uj+i is a linear combination of vectors in C/i, but [/; does 
not span the entire range space of A^^'. This can be fixed by taking a vector Ui+i _L Ui 
with u,+i in range of A^^'. 



4-.1.1. Tensors with Given Cubical Ranks 

Assume that the I x m x n tensor has a cubical low rank, i.e. rank(^) = (r, r, r) with 
r < min{Z,TO,n}. Then there exist a tensor C £ r»'X''x»'^ and full column rank matrices 
X, Y, Z such that A = (A, Y, Z) ■ C. 

We will now prove that, when the starting vectors ui, vi and wi are in the range of 
the respective subspaces, the minimal Krylov procedure generates matrices U, V, W, such 
that span(t/) = span(A), span(y) = span(y) and span(W^) = span(Z) after r steps. Of 
course, for the low multilinear rank approximation problem of tensors it is the subspaces 
that are important, not their actual representation. The specific basis spanning e.g. 
span(A) is ambiguous. 

Theorem 2. Let A ^ {X,Y,Z) -C e M'^^x" with rank(^) = {r,r,r). Assume we 
have starting vectors in the associated range spaces, i.e. ui G span(A), vi S span(y), 
wi € span(Z). Assume also that the process does not break dowi^ within r iterations. 
Then the minimal Krylov procedure in Algorithmic generates matrices Ur,Vr, Wr with 

span(C/r) = span(A), span{Vr) — span(y), span(M^r) = span(Z). 

Proof. First observe that the recursion generates vectors in the span of A, Y, and Z, 



*In the matrix case a breakdown occurs in the Krylov recursion for instance if the matrix and the 
starting vector have the structure 



Ai 
A2 



An analogous situation can occur with tensors. 

^The newly generated vector is not a linear combination of previously generated vectors. 



respectively: 

A ■ {v, u))2 3 ^ C ■ (X"^, Y^v, Z~^w) ^ C • {X^, V, w) = Xci, 
A ■ (u, w)^ 3 = C ■ {X~^u, Y~^, Z'^w) ^C-{u, y"^, w) = Yc2, 
A • (u, u)^ 2 = C ■ {X~^u, Y~^v, Z~^) =C- (u, V, Z~^) = Zc3, 

where in the first equation v — Y~^v, w — Z~^w and ci = C • {v^'w)^ 37 and the other 
two equations are analogous. Consider the first mode vector u. Clearly it is a linear 
combination of the column vectors in X. Since we orthonormalize every newly generated 
M-vector against all the previous vectors, and since we assume that the process does 
not break down, it follows that dim(span([ui • • • Ufc])) = k for k < r will increase by 
one with every new m- vector. Given that ui G span(X) then for k = r we have that 
span([wi • • • Ur\) = span(X). The proof is analogous for the second and third modes. D 

We would like to make e few remarks on this theorem: 

Remark (1). It is straightforward to show that when the starting vectors are not in the 
associated range spaces we would only need to do one more iteration, i.e. in total r + 1 
iterations, to obtain matrices LV+i, K-+1 and Wr+i that would span the column spaces 
oi X, Y and Z, respectively. 

Remark (2). It is easy to obtain starting vectors ui £ span(X), vi G span(F) and 
wi G span(Z). Choose any single nonzero mode- A: vector or the mean of the mode-fc 
vectors. 

Remark (3). Even if we do not choose starting vectors in the range spaces of X, Y, Z and 
run the minimal Krylov procedure r + 1 steps we can easily obtain a matrix Ur spanning 
the correct subspaces. To do this just observe that t/^+i^^^'' = Uj+iXC^'^^{Y ® ZY is 
an (r + 1) x mn matrix with rank r. 

4-. 1.2. Tensors with General Low Multilinear Rank 

Next we discuss the case when the tensor A E ]r'x™x" has rank(^) = {p,q,r) with 
p < I, q < ni, and r < n. Without loss of generality we can assume p < q < r. 
Then A — (X, Y,Z) • C where C \s a, p x q x r tensor and X, Y, Z are full column rank 
matrices with conformal dimensions. The discussion assumes exact arithmetic and that 
no breakdown occurs. 

From the proof of Theorem [5] we see that the vectors generated are in the span of X, 
y, and Z , respectively. Therefore, after having performed p steps we will not be able 
to generate any new vector in the first mode. This can be detected from the fact that 
the result of the orthogonalization is zero. We can now continue generating vectors in 
the second and third modes, using any of the first mode vectors, or a (possibly random) 
linear combination of thenij. This can be repeated until we have generated q vectors 
in the second and third modes. The final r — q mode-3 vectors can then be generated 
using combinations of mode-1 and mode-2 vectors that have not been used before, or. 



^Also the optimization approach of Section 14.31 can be used. 
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alternatively, random linear combinations of previously generated mode-1 and mode-2 
vectors. We refer to the procedure described in this paragraph as the modified minimal 
Krylov recursion. 

Theorem 3. Let A G M'^™^" be a tensor o/ rank(^) = {p,q,r) with p < q < r. We 
can then write A = {X, Y, Z) • C, where Cisapxqxr tensor and X, Y, Z are full 
column rank matrices with conforming dimensions. Assume that the starting vectors 
satisfy ui G span(X), vi £ span (y) and wi G span(Z). Assume also that the process 
does not break down except when we obtain a set of vectors spanning the full range 
spaces of the different modes. Then in exact arithmetic, and in a total of r steps the 
modified minimal Krylov recursion produces matrices Up, Vq and Wr, which span the 
same suhspaces as X, Y , and Z , respectively. 

The actual numerical implementation of the procedure in floating point arithmetic 
is, of course, much more complicated. For instance, the ranks will never be exact, so 
one must devise a criterion for determining the numerical ranks that will depend on the 
choice of tolerances. This will be the topic of our future research. 

4-1.3. Partial Factorization 

To our knowledge there is no simple way of writing the minimal Krylov recursion 
directly as a tensor Krylov factorization, analogous to ((T3)) . However, having generated 
three orthonormal matrices Uk, Vfc, and Wk, we can easily compute a low-rank tensor 
approximation of A using Lemma [I] 

A^{Uk,Vk,Wk)-H, H = {Uj,v;^,W^)-AeM.'''"''"'. (18) 

Obviously, Hxf^y ~ A- {u\,Vfj^,w^) . Comparing with Algorithm 2] we see that T-i contains 
elements from the Hessenberg matrices 7J", H'",H'^ , which contain the orthogonalization 
and normalization coefficients. However, not all the elements in TL are generated in the 
recursion, only those that are close to the "diagonals". Observe also that 7i has k^ 
elements, whereas the minimal Krylov procedure generates three matrices with total 
number of 3fc^ elements. We now show that the minimal Krylov procedure induces a 
certain partial tensor-Krylov factorization. 

Proposition 4. As.sume that Uk, Vk, and Wk have been generated by the minimal Krylov 
recursion and that % — A- (Uk, Vk, Wk) ■ Then, for 1 < i < k — 1, 

{A-{Vk,Wk)^^:,){:,i,i) = {{Uk)^-U){:,i,i) = UkH^{:,i), (19) 

{A • {Uk, W^fc)i,3)(* + 1: -^ *) = ((^fc)2 • ^)(* + 1: :: *) = VkH-{:, i), (20) 

{A ■ (Uk, T4)i^2)(« + 1,* + 1, = {{Wk)^ ■ mil + 1,1 + 1,:) = WkH^{:,i). (21) 

Proof. Let 1 < i < fc — 1 and consider the fiber 

'H{:,i,i) = [h2ii /i2m ••• ft.i+i,rj ^i+2,M ••• hkii] 
Since, from the minimal recursion, 

i+l 



A ■ (vi, Wi)^.^ = ^ hxiiUx = Ui+iH^{:, i), 
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we have, for i + 2 < s < k, 

hsii = A- {Us,Vi,Wi) = (wj)^ • [A- {v.i,Wi)^.^) = 0. 

Thus hij^2,ii — ■ ■ ■ — hkii — 0. Therefore, the fiber in the left hand side of (|19p is equiva- 
lent to the minimal recursion for computing u^+i. The rest of the proof is analogous. D 

If the sequence of vectors is generated according to Equations (fT5]) - ([T7|) . then a similar 
(and simpler) proposition will hold. For example we would have 

{A • (C/fe, l^fe)i^3)(z, :, = (14)2 • nil, :, i) - VkH^'i:, i), i = 1, . . . , fc. 

4.. 2. A Maximal Krylov Recursion 

Note that when a new u^+i is generated in the minimal Krylov procedure, then we 
use the most recently computed Vi and wi. In fact, we might choose any combination 
of previously computed {wi, V2, ■ ■ ■ ,Vi\ and {wi, . . . , w^} that have not been used before 
to generate a m- vector. Let j < i and k < i, and consider the computation of a new 
M- vector, which we may write 

hu = Uj{A- {Vj,Wk)2^-i) 
KjkUi+l = A • {Vj,Wk)2 3 ~ ^^^^ 

Thus if we are prepared to use all previously computed v- and w-vectors, then we 
have a much richer combinatorial structure, which we illustrate in the following diagram. 
Assume that ui and vi are given. In the first steps of the maximal Krylov procedure the 
following vectors can be generated by combining previous vectors. 



{ui} X {vi} 

{vi} X {wi} 

{ui,U2} X {wi} 

{ui,M2} X {W1,W2,W3} 

{wi,W2,W3} X {wi,W2,.--,We} 

{U1,W2,...,W19} X {wi,W2,--.,W(i} 



U2 

{V2 V3} 

{{wi),W2,W3,W4:,W5,We} 

{{u2),U3,...,Uig} 

{{v2),{v3),VA,-.-,Vii5} 



Vectors computed at a previous step are within parentheses. Of course, we can only 
generate new orthogonal vectors as long as the total number of vectors is smaller than 
the dimension of that mode. Further, if at a certain stage in the procedure we have 
generated a and /3 vectors in two modes, then we can generate altogether 7 — a/3 vectors 
in the third mode (where we do not count the starting vector in that mode, if there was 
one). 

We will now describe the first three steps in some detail. Two starting vectors ui and 
Vi, in the first and second mode, respectively. We also assume that ||ui|| = ||wi|| = 1. 
The normalization and orthogonalization coefficients will be stored in a tensor Ti. Its 
entries are denoted with hijk = 'H{i,j, k). Also when subscripts are written on tensor H, 
they will indicate the dimensions of the tensor, e.g. '^211 is a 2 x 1 x 1 tensor. 
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Step (1). In the first step we generate an new third mode vector by computing 

A- (wi,wi)i2 = ^111^1 = (■^1)3 -^111, (22) 

where hm — Jim is a normahzation constant. 
Step (2). Here we compute a new first mode vector; 

U2^ A- {vi.wi)^^ . 
The orthogonahzation coefficient satisfies 

ulu2 = u1{A- (wi,wi)2 3) = A- {ui,vi,wi) = ■wl{A- (wi,wi)i2) = ^^111' (23) 

from (j22p . After orthogonalization and normalization, 

/1211M2 = "2 - ft-iiiui, (24) 

and rearranging the terms in (|24p . we have the following tensor-Krylov factorization 

A ■ (Wi, Wi)2,3 ^ (["1' "zDi • H21I, 7^211 = 

Step (3). In the third step we obtain two second mode vectors. To get V2 we compute 

V2^ A- (Ui, Wi)i 3 , /ll2lt'2 = W2 - /iiiiwi; 

the orthogonalization coefficient becomes hm using an argument analogous to that in 

(ESI). 

Combining U2 with wi will yield ^3 as follows; first we compute 

V'i= A- (u2,Wl)i^3, 

and orthogonalize 

vivz = A- {u2,Vi,Wi) = ul{A- (^1,^1)2^3) = ulu2 = /l211- 

We see from ((24|) that /i2ii is already computed. The second orthogonalization becomes 

V2V3, = A- (u2,V2,Wl) =: ft-221- 

Then 

h2'ilV'i =V'i- /l21ll'l - ^22lW2 

After a completed third step we have a new tensor-Krylov factorization 



^2xmxl 3 _^ . (j^^ U2], U;i)i 3 = ([i;i V2 f3])2 ' ^231, ^231 

Note that the orthogonalization coefficients are given by 

h\^^y = A- iux.,v^,w^) . 
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Algorithm 5 Maximal Krylov recursion 



ui, vi given starting vectors of length one 

hiiiwi = A- (ui,wi)i 2 

a = (3 = ^ = 1, Ua = ui, Vp = vi and W^ = wi 
while a < amax and /3 < /3max and 7 < 7max do 
% U-/00P % 

Ua = [Ml^...,Ua], f7 = [j, Va = [ui,...,U/j], W^ = [wi,...,w^], i = 1 

for all (/3, 7) such that (3 < /3 and 7 < 7 do 
if the pair (/3, 7) has not been used before then 

ha = %{! : a,/3,7) 
hi^ A- {U,vp,w^) 

nia + l:a + ij,^) = [hj /i„+, j^]T 

U = [C/Ua+i], i = i + 1 
end if 
end for 
C/0^+i = [[/„[/], a = ;37+l 

% v-loop % 

Ua = [ui,...,Ua], Vp = [wi,...,w^], y = [], W^^ = [wi,...,u;^], j = 1 
for all (a, 7) such that a < a and 7 < 7 do 
if the pair (a, 7) has not been used before then 

hp=n{aA:p,^) 

hj = A- {Ua, V, W^) 

ha,i3+j,^vp+j = A • (uq, w^)i 3 - Vphp - Vhj 

V=[Vvp+,],j=j + l 
end if 
end for 

Vaj+i = [VpV],l3 = aj + l 

% w-loop % 

Ua = [Ul,.._. ,Uq], Vp = [vi,...,Vj3],Wj = [wi,...,w^], W ^[], k =1 

for all (a, /3) such that a < a and (3 < /3 do 
if the pair (a, /3) has not been used before then 

h^ — H{a,l3, 1 : 7) 

hk^ A- {ua,Vp,W) 

^a/3,7+feW7+fc = -4' (""'"^^g)! 2 ~ W^7^7 " ^^^fc 

?^(aj,7 + l:7 + fc) = [^l '/ia/3,7+fc]^ 
VF= [M^w^+fc], k = k + l 

end if 

end for 

Wap = [W^Wlj = a(3 
end while 



This maximal procedure is presented in Algorithm [5] The algorithm has three main 
loops, and it is maximal in the sense that in each such loop we generate as many new 
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vectors as can be done, before proceeding to the next main loop. Consider the w-loop (the 
other loops are analogous). The vector ha is a mode-l vectouand contains orthogonaliza- 
tion coefHcients with respect to u-vectors computed at previous steps. These coefficients 
are values of the tensor Ti. The vector hi on the other hand contains orthogonalization 
coefficients with respect to m- vectors that are computed within the current step. Its di- 
mension is equal to the current number of vectors in U. The coefficients hi together with 
the normalization constant h^j^i 9 7 ^^ ^^^ newly generated vector Ua+i are appended at 
the appropriate positions of the tensor Ti. Specifically the coefficients for the u- vector ob- 
tained using Vfj and w^ are stored as first mode fiber, i.e. T-L{:, /3, 7) = [hj^ hj h^^.^jj.^]^ . 
Since the number of vectors in U are increasing for every new m- vector the dimension of 
[/ij hJ ft-Q+i j^^]^ and thus the dimension of TL along the first mode increases by one as 
well. The other mode-l fibers are filled out with a zero at the bottom. Continuing with 
the i;-loop, the dimension of the coefficient tensor Ti. increases in the second mode. 

It is clear that TL has a zero-nonzero structure that resembles that of a Hessenberg 
matrix. If we break the recursion after any complete outer for all-statement, we can 
form a tensor-Krylov factorization. 

Theorem 5 (Tensor Krylov factorizations). Let a tensor A G R'x™x" and two starting 
vectors ui and vi be given. Assume that we have generated matrices with orthonormal 
columns using the maximal Krylov procedure of Algorithm [5| , and a tensor % 0/ or- 
thonormalization coefficients. Assume that after a complete u-loop the matrices Ua, Vp, 
and W^, and the tensor TLapj G R"X;3x7^ have been generated, where a < I, P < m, and 
7 < n. Then 

A ■ iVp,W^)^^^ = (Ua), ■ Ua0^. (25) 

Further, assume that after the following complete v-loop we have orthonormal matrices 
Ua, Vp, Wy, and the tensor n^p-y G M"^^^'^ where ^ = a'j + l> p. Then 

A-{Ua,W,),,^{V^)^-nap^. (26) 

Similarly, after the following complete w-loop, we will have orthonormal matrices Ua, 
Vp, W^ and the tensor U^M e K"'"^'''^ where ^ = ai3> -f. Then 

A-{Ua,Vp)^^ = {W^),-Hap^. (27) 

It also holds that Hapi = ^0^7(1 : a, 1 : /3, 1 : 7) and H^p^ — Hap^fi^ ■ o^A ■ ^A ■ l), 
i.e. all orthonormalization coefficients from the u-, v- and w-loops are stored in a single 
and common tensor %. 

Proof. We prove that (|25p holds; the other two equations are analogous. Using the 
definition of matrix-tensor multiplication we see that A-{Vp, W^)^ ^ is a tensor in M'^^^'', 
where the first mode fiber at position (j, k) with j < /3 and /c < 7 is given by u\ = 
A • (vj, Wk)2 3 with X = [j — 1)7 + fc + 1. 



^We here refer to the identification ^. 
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On the right hand side the corresponding first mode fiber T-L{:,i,k) is equal to 

A- (ui,Vj,Wk) 
A- (u2,Vj,Wk) 



hijk 
h2jk 




hx-ijk 

hxjk 







A 



(ux-i,Vj,Wk) 
hxjk 




Thus we have 



ux 



A- (v.j,Wk)2^3, ^ y^^^ijkUj, 



j=l 



which is the equation for computing ux in the algorithm. 



D 



Let Uj and Vk be two matrices with orthonormal columns that have been generated 
by any tensor Krylov method (i.e., not necessarily a maximal one) with tensor A. Assume 
that we then generate a sequence of to = jk vectors {wi,W2, ■ ■ ■ , Wm) as in the w-loop of 
the maximal method. From the proof of Theorem[5]we see that we have a tensor-Krylov 
factorization of the type ([27]), 



A-iUj,Vk)^2^{W„,)s-H 



jkm ■ 



(28) 



It is clear that the dimensions of the U, V and W in the maximal Krylov recursion be- 
come very large even after only 6~7 steps of the procedure. It is not clear how preserving 
a tensor-Krylov factorization can be utilized in practical implementation applications. 
For the matrix case the theory of Krylov factorizations is very important in enabling 
efficient implementation for various algorithms. However, the maximal Krylov recursion 
suggests a way for an efficient algorithmic implementation. Consider the multilinear 
approximation problem of an / x tti x n tensor A 



min \\A- 
u,v,w,S 



iU,V,W)-S\\ 



which is equivalent to 



max \\A ■ (U, V, W) 11, U^U = /, V^V = I, W^W = / 

UVW V > ' / III 



and U, V, W are I x p, ni x q and n x r orthonormal matrices, respectively. Assume that 
we have generated Vp = [vi ... vp] and Wj = [wi . . . w^] using the maximal Krylov 
recurrence but the number of combinations of v- and w- vectors exceeds the number of 
M-vectors that are desired, i.e. 13"/ > p. A natural thing to do in this case is to compute 
the product Up^ — A ■ (Vg, W^)^ 3 and compute the p dimensional dominant subspace of 

its mode one matricization Ui . Similarly for the other modes. With this modification 
we no longer have a tensor-Krylov factorization, however we can manage the blow up in 
the size of the dimensions for U, V, W and obtain efficient algorithms. Although natural, 
this approach may still be impractical. For example if I = 10'', and /? = 7 = 100, then 
Ug^ will be a large and dense 10^ x 10"* matrix. If we are interested in an approximation 
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with p = 100 (rank of the first mode in the approximation) an alternative to compute the 
dominant 100 dimensional subspace of Ui^ would be to take dominant (in some sense) 
Vio and Wio subspaces of Vg and W^, respectively, and compute Wioo = A- (ViO; Wj 



10, 



2,3' 



Then C/ioo is obtained from the columns of Wioo- 



4-3. Optimized Minimal Krylov Recursion 

In some applications it may be a disadvantage that the maximal Krylov method 
generates so many vectors in each mode. In addition, when applied as described in 
Section W?^ it generates different numbers of vectors in the different modes. Therefore it 
is natural to ask whether one can modify the minimal Krylov recursion so that it uses 
"optimal" vectors in two modes for the generation of a vector in the third mode. Such 



procedures have recently been suggested in llj . We will describe this approach in terms 



of the recursion of a vector in the mode 3. The corresponding computations in modes 1 
and 2 are analogous. 

Assume that we have computed i vectors in the first two modes, for instance, and 
that we are about to compute Wi. Further, assume that we will use linear combinations 
of the vectors from modes 1 and 2, i.e. we compute 



A-{u,e,v,7^)^ 



2 ' 



where 9, rj G K* are yet to be specified. We want the new vector to enlarge the 'W 
subspace as much as possible. This is the same as requiring that Wi be as large (in norm) 
as possible under the constraint that it is orthogonal to the previous mode-3 vectors. 
Thus we want to solve 

max||u;||, where w = A- {Ui9,Vi'q)^2^ (29) 



^,r) 



w^W,.u 11^11 = hll=l, 0,7^ € 



The solution of this problem is obtained by computing the best rank-(l, 1,1) approxima- 
tion (6*, ?7, cli) • S of the tensor 

C^ = A.{U,,V,,I-W,.iWj_^). (30) 

A suboptimal solution can be obtained from the HOSVD of Cw 

Recall the assumption that A G K' ><"*><" is large and sparse. Clearly the optimization 
approach has the drawback that the tensor Cw is generally a dense tensor of dimension 
i xi xn, and the computation of the best rank-(l, 1, 1) approximation or the HOSVD of 
that tensor can be quite time-consuming. Of course, in an application, where it is essential 
to have a good approximation of the tensor with as small dimensions of the subspaces as 
possible, it may be worth the extra computation needed for the optimization. However, 
we can avoid handling large, dense tensors by modifying the optimized recursion, so that 
an approximation of the solution of the maximization problem (|29p is computed using t 
steps of the minimal Krylov recursion on the tensor Cw, for small t. 

Assume that we have computed a rank-(i, t, t) approximation of C^,, 

Cw ~ {Q,H,il) • Sw, 
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for some small value of t, using the minimal Krylov method. By computing the best 
rank-(l,l,l) (or HOSVD) approximation of the small tensor Sw G R*^*^*, we obtain 
an approximation of the solution of (I^Hl) . It remains to demonstrate that we can apply 
the minimal Krylov recursion to C^ without forming that tensor explicitly. Consider the 
computation of a vector w in the third mode, given the vectors 6, and tj: 

:C„ • (0,77)12 = iA-{U,,V,,I~W,.iWl,)) ■ (0,r,),_2 (31) 

= (^ • iU,0, V,r^),,) ■ {I - W,Wj).^ = {I- W,WJ)Cj. 

Note that the last matrix-vector multiplication is equivalent to the Gram-Schmidt or- 
thogonalization in the minimal Krylov algorithm. Thus, we have only a sparse tensor- 
vector-vector operation, and a few matrix-vector multiplications, and similarly for the 
computation of 6 and rj. 

It is crucial for the performance of this outer-inner Krylov procedure that a good 
enough approximation of the solution of (P^)) is obtained for small t, e.g. t equal to 2 
or 3. We will see in our numerical examples that it gives almost as good results as the 
implementation of the full optimization procedure. 

44. "Small" Mode 

In information science applications it often happens that one of the tensor modes has 
much smaller dimension than the others. For illustration assume that the first mode is 
small, i.e. I <C min(r7z,n). Then in the Krylov variants described so far, after I steps the 
algorithm has produced a full basis in that mode, and no more need be generated. Then 
the question arises which m- vector to choose, when new basis vectors are generated in 
the other modes. Two obvious alternatives are to use the vectors ui, . . . ,ui in a cyclical 
way, or to take a random linear combination. One may also apply the optimization idea 
in that mode, i.e. in the computation of Wi perform the maximization 

max||{y||, where w = A- {Ui9,v,)^ ^ , w±Wi^i, ||6i|| = 1, 9 eW. 

' 

The problem can be solved by computing a best rank-1 approximation of the matrix 

C^ ^ A- {U,,v,,I -W,.iWj_^) . 

As before, this is generally a dense matrix with one large mode. A rank one approxima- 
tion can again be computed, without forming the dense matrix explicitly, using a Krylov 
method (here the Arnold! method). 

4.. 5. Krylov Subspaces for Contracted Tensor Products 

Recall from Section l3^ that the Golub-Kahan bidiagonalization procedure generated 
matrices Uk,Vk, which are orthonormal basis vectors for the Krylov subspaces of AA^ 
and v4^A, respectively. In tensor notation those products may be written as 

{A,A)_^ = AA^, {A, A) _^ ^ A'^ A. 
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For a third order tensor A € R'x'^x"^ and starting vectors u G R',w G R™,ii; G M" we 
may consider the matrix Krylov subspaces 

ICpiiA, A)_, , u), {A, A)_, = ^(i)(^(i))T G M'^', 

ICg{{A, A)_2 , v), {A, A)_^ = A(2)(A(2))T e M"^", 

JCri{A, ^)„3 , W), {A, A)_3 = A(3) (yl(3))T £ M"^". 

The expressions to the right in each equation are matricized tensors. It suffices for our 
discussion to know that for an Z x to x n tensor A one can associate three matrices 
^(1) g T^ixmn^ ^(2) ^ j^mx/n ^^j ^(3) (= ]gnx/m p^j. details the interested reader may 

consider [J, |7|, |9|, |3l[ . In this case we reduce a third order tensor to three different (sym- 
metric) matrices, for which we compute the usual matrix subspaces through the Lanczos 
recurrence. This can be done without exphcitly computing the products (.4, ,4) ,-, thus 
taking advantage of sparsity. To illustrate this consider the matrix times vector operation 
^(i)(^(i))Ty^ which can be written 

n 

[Ai . . . A„] [Ai . . . A„]Tu ^ J2 ^«^7"' (32) 

where Ai = A{:, :, i) is the i'th frontal slice of A. 

The result of the Lanczos process separately on the three contracted tensor products 
is three sets of orthonormal basis vectors for each of the modes of the tensor, collected 
in Up,Vq,Wr, say. A low-rank approximation of the tensor can then be obtained using 
Lemma [TJ 

It is straightforward to show that if ^ = {X, Y, Z) ■ C with rank(^) = {p, q, r), then 
the contracted tensor products 

{A,A)_, - A'-^^A^^^)'^ = XC^^\y (g, ZY{Y (g, Z)(C(i))Tx"r, (33) 

{A,A)_2 = A(2)(A(2))T = YC^^\X (g, ZY{X ® Z)(C(2))TrT^ (34) 

{AA)_^ = yl(3)(A(3))T ^ ZC(3)(X ® Y){X (g, r)(C(^))TzT, (35) 

are matrices with ranks p, q and r, respectively. Then it is clear that the separate Lanczos 
recurrences will generate matrices C/, V, W that span the same subspaces as X, Y, Z in p, 
q and r iterations, respectively. 

Remark. Computing p (or q or r) dominant eigenvectors of the symmetric positive 
semidefinite matrices {A, A) _j^ , {A, A) _2, {A,A)_^, respectively, is equivalent to com- 
puting the truncated HOSVD of A. We will show the calculations for the first mode. 
Using the HOSVD A = {U, V, W)- S, where now U, V, and W are orthogonal matrices 
and the core S is all-orthogonal JTl, we have 

{A,A)_^ = US'^^'>{V ® W^iV g) W){S^^'>yu'^ = USU^, 

where S = S'(-^'(S'(^^)^ = diag(cri, (t|, . . . ,cr;^) with af > af^^ and ai are first mode 
multilinear singular values of A. 
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4-6. Complexity 

In this subsection we will discuss the amount of computations associated to the dif- 
ferent methods. Assuming that the tensor is large and sparse it is likely that, for small 
values of k (compared to I, m, and n), the dominating work in computing a rank-(fc, fc, k) 
approximation is due to tensor- vector- vector multiplications. 

Minimal Krylov recursion. Considering Equation (jlSp . it is clear that computing the 
k X k X k core tensor T-i is necessary to have a low rank approximation of A. From 
the proof of Proposition 2] we see that T-L has a certain Hessenberg structure along and 
close to its "two-mode diagonals". However, away from the "diagonals" there will be 
no systematic structure. We can estimate that the total number of tensor-vector-vector 
multiplications for computing the k x k x k tensor H is k'^. The computation of H can 
be split as 

n^A-iUk,Vk,Wk)=Auv-{Wk)3, where A. = ^ • (C/fe, ^4)1,3 ■ 

There are k^ tensor- vector- vector products for computing the k x k x n tensor Auv ■ The 
complexity of the following computation Auv • (W^Oa i^ 0{k^n), i.e. about k^ vector- 
vector inner products. 

Several of the elements of the core tensor are available from the generation of the 
Krylov vectors. Naturally they should be saved to avoid unnecessary work. Therefore 
we need not include the 3A; tensor-vector-vector multiplications from the recursion in the 
complexity. 

Maximal Krylov Recursion. There are several options in the use of the maximal recur- 
sion for computing a rank-(A;,fc, fc) approximation. One may apply the method until all 
subspaces have dimension larger than k. In view of the combinatorial complexity of the 
method the number of tensor- vector- vector multiplications can then be much higher than 
in the minimal Krylov recursion. Alternatively, as soon as one of the subspaces reaches 
dimension k, one may stop the maximal recursion and generate only the remaining vec- 
tors in the other two modes, so that the final rank becomes {k,k,k). That variant has 
about the same complexity as the minimal Krylov recursion. 

Optimized Krylov Recursion. The optimized recursion can be implemented in different 
ways. In Section 14.31 we described a variant based on "inner Krylov steps" . Assuming 
that we perform t inner Krylov steps, finding the (almost) optimal w (j3ip requires 3i 
tensor-vector-vector multiplications. Since the optimization is done in k outer Krylov 
steps in three modes we perform 9kt such multiplications. The total complexity becomes 
k^ + 9kt. In [ll[ another variant is described where the optimization is done on the core 
tensor. 

"Small" Modes. Assume that the tensor has one small mode and that a random or fixed 
combination of vectors is chosen in this mode when new vectors are generated in the 
other modes. Then the complexity becomes fc^ -I- 2k. 
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Krylov Subspaces for Contracted Tensor Products. In each step of the Krylov method 
a vector is muhiphed by a contracted tensor product. This can be implemented using 
P2p . If we assume that each such operation has the same complexity as two tensor- 
vector-vector multiplications, then the complexity becomes k^ + 6k, where the second 
order term is for computing the core tensor. 

The complexities for four of the methods are summarized in Table \T\ 



Method 


Complexity 


Minimal Krylov 


k-' 


Optimized minimal Krylov 


P + 9kt 


"Small" mode 


e + 2k 


Contracted tensor products 


fc2+6fc 



Table 1: Computational complexity (tensor- vector- vector multiplications) for the computation of a rank- 
(k, k, k) approximation writh different methods. In the optimized Krylov recursion t inner Krylov steps 
are made. 



5. Numerical Examples 

The purpose of the examples in this section is to make a preliminary investigation 
of the usefulness of the concepts proposed. We will generate matrices U, V, W using the 
various Krylov procedures and, in some examples for comparison, the truncated HOSVD. 
Given a tensor A and matrices U, V, W the approximating tensor A has the form 

A = (t/, V, W) ■ C, where C^A-{U,V,W). (36) 

Of course, for large problems computing A explicitly (by multiplying together the ma- 
trices and the core C) will not be feasible, since that tensor will be dense. However, it is 
easy to show that approximation error is 

\\A-Ar = \\Ar-\\cr. 

For many a ppl ications a low rank approximation is only an intermediate or auxiliary 



result, see e.g. 25|. It sometimes holds that the better the approximation (in norm), the 
better it will perform in the particular application. But quite often, especially in infor- 
mation science applications, good performance is obtained using an approximation with 
quite high error, see e.g. J15i |. Our experiments will focus on how good approximations 
are obtained by the proposed methods. How these low rank approximations are used will 
depend on the application as well as on the particular data set. 

For the timing experiments in Sections 15.11 and 15.21 we used a MacBook laptop with 
2.4GHz processor and 4 GB of main memory. For the experiments on the Netflix data 
in Section 15.31 we used a 64 bit Linux machine with 32 GB of main memory running 
Ubuntu. The calculations were performed using Matlab and the TensorToolbox, which 
supports computation with sparse tensors [3|, Ij] . 
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dini(^)\rank(^) 


(10,10,10) 


(10,15,20) 


(20,30,40) 


Method 


(1) (2) 


(1) (2) 


(1) (2) 


50 X 70 X 60 
100 X 100 X 100 
150 X 180 X 130 


0.087 0.165 
0.38 2.70 
1.32 11.27 


0.113 0.162 
0.44 2.72 
1.44 11.07 


0.226 0.169 
0.91 2.71 
3.01 11.01 



Table 2: Timing in seconds for computing the HOSVD of low rank tensors using (1) the modified minimal 
Krylov method and HOSVD of the smaller core H and (2) truncated HOSVD approximation of A. 

5.1. Minimal Krylov Procedures 

We first made a set of experiments to confirme that the result in Theorem [3] holds for 
a numerical implementation, using synthetic data generated with a specified low rank. 

It is not uncommon that tensors originating from signal processing applications have 
low multilinear ranks. Computing the HOSVD of such a tensor A can be done by direct 
application of the SVD on the different matricizations A^^^^ for i = 1,2, 3. An alternative 
is to first compute Up,Vq,Wr using the modified minimal Krylov procedure. Then we 
have the decomposition A = {Un,Vq, Wr) • "H. To obtain the HOSVD of A we compute 
the HOSVD of the much smalleCl tensor TL = [U, V, W) • C. It follows that the singular 
matrices for A are given by UpU, VqV and WrW. We conducted a few experiments to 
compare timings for the two approaches. Tensors with three different dimensions were 
generated and for each case we used three different low ranks. The tensor dimensions, 
their ranks and the computational times for the respective case are presented in Table 
m We see that for the larger problems the computational time for the HOSVD is 2-8 
times longer than for the modified minimal Krylov procedure with HOSVD on the core 
tensor. Of course, timings of Matlab codes are unreliable in general, since the efficiency 
of execution depends on how much of the algorithm is user-coded and how much is 
implemented in Matlab low-level functions (e.g. LAPACK- based). It should be noted 
that the tensors in this experiment are dense, and much of the HOSVD computations are 
done in low-level functions. Therefore, we believe that the timings are rather realistic. 

Next we let A G ]]j50x60x4o ^^ g^ random tensor, and computed a rank-(10, 10, 10) 
approximation using the minimal Krylov recursion and a different approximation using 
truncated HOSVD. Let the optimal cores computed using Lemma [1] be denoted T^min 
and Hhosvd, respectively. We made this calculation for 100 different random tensors and 
report (||Hinin|| — |!'Hhosvd||)/||'Hhosvd|| for each case. Figure [T] illustrates the outcome. 
Clearly, if the relative difference is larger than 0, then the Krylov method gives a better 
approximation. In about 80% of the runs the minimal Krylov method generated better 
approximations than the truncated HOSVD, but the difference was quite small. 

In the following experiment we compared the performance of different variants of the 
optimized minimal Krylov recursion applied to sparse tensors. We generated tensors 
based on Facebook graphs for different US universities l29| . The Caltech graph is repre- 
sented by a 597 x 597 sparse matrix. For each individual there is housing information. 
Using this we generated a tensor of dimension 597 x 597 x 64, with 25646 nonzeros. The 



'A is a, I xmxn tensor and H is a pxqxr, and usually the multilinear ranks p, q, r are much smaller 
than the dimensions I, in, n of A. 
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Figure 1: Difference between ||ytmin||i approximation obtained with the minimal Krylov method, and 
ll'^hosvdlli approximation obtained by the truncated HOSVD of a 50 X 60 X 40 tensor J^. The ranlc of 
the approximations were (10, 10, 10). 

purpose was to see how good approximations the different methods gave as a function 
of the subspace dimension. We compared the minimal Krylov recursion to the following 
optimized variants: 

Opt-HOSVD. The minimal Krylov recursion with optimization based on HOSVD of 
the core tensor (pO)) . This variant is very costly and is included only as a benchmark. 

Opt-Krylov. The minimal Krylov recursion that utilized three inner Krylov steps to 
obtain approximations to the optimized linear combinations. This is an implemen- 
tation of the discussion from the second part of Section 14.31 

Opt-Alg8. Algorithm 8 in (11,0 

Truncated HOSVD. This was included as a benchmark comparison. 

minK-back. In this method we used the minimal Krylov method but performed 10 
extra steps. Then we formed the core Ti and computed a truncated HOSVD ap- 
proximation oi T-i. As a last step we truncated the Krylov subspaces accordingly. 

In all Krylov-based methods we used four initial minimal Krylov steps before we 
started using the optimizations. 



^The algorithm involves the approximation of the dominant singular vectors of a matrix computed 
from the core tensor. In lllll the power method was used for this computation. We used a linear 
combination of the first three singular vectors of the matrix, weighted by the singular values. 
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Figure 2: Errors in the low rank approximations of the sparse Caltech (top) and Princeton (bottom) 
tensors. 
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Another sparse tensor was created using the Facebook data from Princeton. Here 
the tensor was constructed using a student /faculty flag as third mode, giving a 6593 x 
6593 X 29 tensor with 585,754 non-zeros. 

The results are illustrated in Figure [51 We see that for the Caltech tensor the 
"backward-looking" variant (minK-back) gives good approximations for small dimen- 
sions as long as there is a significant improvement in each step of the minimal Krylov 
recursion. After some ten steps all optimized variants give approximations that are rather 
close to that of the HOSVD. 

For the Princeton tensor we only ran the minimal Krylov recursion and two of the 
optimizations. Here the optimized versions continued to give significant improvements 
as the dimension is increased, in spite of the poor performance of the minimal Krylov 
procedure itself. 

5.2. Test on Handwritten Digits 



Tensor methods for the classification of handwritten digits are described in [24 



25|. We have performed tests using the handwritten digits from the US postal ser- 
vice database. Digits from the database were formed into a tensor V of dimensions 
400 X 1194 X 10. The first mode of the tensor represents pixelf^jj the second mode 
represents the variation within the different classes and the third mode represents the 
different classes. Our goal was to find low dimensional subspaces Up and Vq in the first 
and second mode, respectively. The approximation of the original tensor can be written 
as 

jg400xll94xlO 3 p ^ (^^^ y^)^^^ . J, = ^^^^^^ (37) 

An important difference compared to the previous sections is that here we wanted to 
find only two of three matrices. The class mode of the tensor was not reduced to 
lower rank, i.e. we were computing a rank-(p, g, 10) approximation of P. We com- 
puted low rank approximations for this tensor using five different methods: (1) trun- 
cated HOSVD; (2) modified minimal Krylov recursion; (3) contracted tensor product 
Krylov recursion; (4) maximal Krylov recursion; and (5) optimized minimal Krylov re- 
cursion. Figure [31 shows the obtained results for low rank approximations with {p, q) = 
{(5,10), (10,20), (15,30), ••• , (50,100)}. The reduction of dimensionality in two 
modes required special treatment for several of the methods. We will describe each case 
separately. In each case the low rank approximation Pp^g^io is given by 

Vp^,^,, = {UpUj,V,Vj)^^-V (38) 

where the matrices Up and Vq were obtained using different methods. 

Truncated HOSVD. We computed the HOSVD V ^ ([/, V,W) • C and truncated the 
multilinear singular matrices, i.e. Up = [/(:, 1 : p) and Vq = V{:, 1 : q). 

Modified minimal Krylov recursion. The minimal Krylov recursion was modified in sev- 
eral respects. We ran Algorithm [3| for 10 iterations and obtained Uio, Vio, Wiq. Next we 
ran p — 10 iterations and generated only u and v vectors. For every new u^+i we used 
V and u) as random liner combination of vectors in Vk and Wio , respectively. In the last 
q—p iterations we only generated new v vectors using again random linear combinations 
of vectors in Up and Wio . 



^"Each digit is smoothed and reshaped to a vector. 
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^^ — Truncated HOSVD 
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First mode rank p, sedond mode rank q = 2*p 



Figure 3: The relative error of the low rank approximations obtained using five different methods. The 
X-axis indicates the ranks (p, q, 10) = (p, 2p, 10) in the approximation. 



Contracted tensor product Krylov recursion. For this method we apphed p and q steps of 
the Lanczos method with random starting vectors on the matrices {A, A) _-^ and {A, A) _2, 
respectively. 

Maximal Krylov recursion. We used the maximal Krylov recursion with starting vectors 
ui and vi to generate Wi ^- U2 ^ V3 ^ Wq — >■ C/19 -> V115. Clearly, we now need 
to make modification to the general procedure. Wio was obtained as the third mode 
singular matrix from the HOSVD of the product V • (C/19, ^115)1 2 ^^'^ ^^ ^ '^^^^ ^^^P ^^ 
computed C/100 as the 100 dimensional dominant subspace obtained from the first mode 
singular matrix of V • (V114, 1^10)2 3- "^^^ Up and Vq used for the approximation in (j38p 
were constructed as follows: We formed the product C = P- (t/100, Viis)^ 2 ^^^ computed 
the HOSVD (u, V,w] -€ = 0. Then Up = UiooU{:, 1 : p) and Vg = C/ii5^(:, 1 : q)- 



Optimized minimal Krylov recursion. For this minimal Krylov recursion we used the 
optimization approach, for every new vector, from the first part of Section 14.31 The 
coefficients for the linear combinations were obtained by best rank-(l, 1,1) approximation 
of factors as C^ in Equation ([50]) . 

The experiments with the handwritten digits are illustrated in Figure [31 We made 
the following observations: (1) The truncated HOSVD give the best approximation for 
all cases; (2) The minimal Krylov approach did not perform well. We observed several 
breakdowns for this methods as the ranks in the approximation increased. Every time 
the process broke down we used a randomly generated vector that was orthonormalized 

26 



against all previously generated vectors. (3) The Lanczos method on the contracted 
tensor products performed very similar as the optimized minimal Krylov method. (4) 
The performance of the maximal Krylov method was initially as good as the truncated 
HOSVD but its performance degraded eventually. 

Classifications results using these subspaces in the algorithmic setting presented in 
[25| are similar for all methods, indicating that all of the methods capture subspaces that 
can used for classification purposes. 

5.3. Tests on the Netflix Data 

A few years ago, the Netflix company announced a competition^^! to improve their 
algorithm for movie recommendations. Netflix made available movie ratings from 480,189 
uscrs/costumcrs on 17,770 movies during a time period of 2243 days. In total there were 
over 100 million ratings. We will not address the Netflix problem, but we will use the 
data to test some of the Krylov methods we are proposing. For our experiments we 
formed the tensor A that is 480, 189 x 17, 770 x 2243 and contains all the movie ratings. 
Entries in the tensor for which we do not have any rating were considered as zeros. We 
used the minimal Krylov recursion and the Lanczos process on the products {A,A)_^, 
(■4, ^)_2 &iid {A,A)_^ to obtain low rank approximations of A. 

In Figure m (left plot) we present the norm of the approximation, i.e. ||^min|| = 
||Cinin||, where Cmin = A • {Uk,Vk,Wk). We have the same low rank approximation in 
each mode and the ranks range from k = 5, 10, 15, ... , 100. The plot contains three 
different runs with random initial vectors in all three modes and a fourth curve that 
is initiated with the means of the first, second and third mode fibers. Observe that 
for this size of tensors it is practically impossible to form the approximation ^min — 
[UkUj ,VkV^ ,WkWj) • A since the approximation will be dense. But the quantity 
1 1 Cmin 1 1 is computable and indicates the quality of the approximation. Larger ||Cmin|| 
means better approximation. In fact for orthonormal matrices U, V, W it holds that 

\\C^in\\<\\A\\.^ 

Figure S] (right plot) contains similar plots, but now the approximating matrices 
C/fe, Vfe, Wk are obtained using the Lanczos process on the symmetric matrices {A, A)_i, 
{A,A)_2 and (^, ^)_3. We never formed the first two products, but use the computa- 
tional formula from Equation p2p for obtaining first mode vectors and a similar one for 
obtaining second mode vectors. We did form {A,A)_^ explicitly since it is a relatively 
small matrix. We ran the Lanczos process with ranks k — 5, 10, 15, ... , 100 using ran- 
dom starting vectors in all three modes. Three tests were made and we used the Lanczos 
vectors in Uk,Vk, Wk- In addition we computed the top 100 eigenvectors for each one of 
the contracted products. 

We remark that this Netflix tensor is special in the sense that every third mode fibre, 
i.e. A{i,j,:), contains only one nonzero entry. It follows that the product {A,A)_^ is 
a diagonal matrix. Our emphasis for these tests was to show that the proposed tensor 
Krylov methods can be employed on very large and sparse tensors. 



^'^The competition has obtained huge attention from many researcher and non-researchers. The im- 
provement of 10 % that was necessary to claim the prize for the contest was achieved by join efforts of 
a few of the top teams [22(| . 
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Figure 4: We plot ||Cmin|| as a function of the rank {p,q,r) = {j>,p,p) in the approximation with 
p = 5, 10, 15, . . . , 100. Left plot: Up, Vp, Wp are obtained using the minimal Krylov recursion. Four 
runs are presented: one using starting vectors mi,iii,«)i as the means of the mode one, two and three 
fibers of A and three runs with different set of random initial vectors. Right plot: The subspaces for 
his case were obtained from separate Lanczos recurrences of contracted tensor products. The starting 
vectors were chosen as in the left plot. 

In this experiment it turned out to be more efficient to store the sparse Netflix tensor 
sUce-wise, where each sHce itself was a sparse matrix, than using the sparse tensor format 
from the Tensor Toolbox. 



6. Conclusions and Future Work 

In this paper we propose several ways to generalize matrix Krylov methods to ten- 
sors, having applications with sparse tensor in mind. In particular we introduce three 
different methods for tensor approximations. These are the minimal, maximal Krylov 
m,ethods and the contracted tensor product methods. We prove that, given a tensor of 
the form A — {X,Y,Z) • C with rank(^) = rank(C) = {p,q,r), a modified version of 
the minimal Krylov recursion extracts the associated subspaces of A in max{p, q, r} it- 
erations. We also investigate a variant of the the optimized minimal Krylov recursion 
ll( , which gives better approximation than the minimal recursion, and which can be im- 
plemented using only sparse tensor operations. We also show that the maximal Krylov 
approach generalizes the matrix Krylov factorization to a corresponding tensor Krylov 
factorization. 

The experiments clearly indicate that the Krylov methods are useful for low-rank 
approximation of large and sparse tensors. In 11| it is also shown that they are efficient 
for further compression of dense tensors that are given in canonical format. The tensor 
Krylov methods can also be used to speed up HOSVD computations. 

As the research on tensor Krylov methods is still in a very early stage, there are 
numerous questions that need to be answered, and which will be the subject of our 
continued research. We have hinted to some in the text: here we list a few others. 



1. Details with respect to detecting true break down, in fioating point arithmetic, and 
distinguishing those from the case when a complete subspaces is obtained need to 
be worked out. 
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2. A difficulty with Krylov methods for very large problems is that the basis vectors 
generated are in most cases dense. Also, when the required subspace dimension is 
comparatively large, the cost for (re)orthogonalization will be high. For matrices 
the subspaces can be improved using the implicitly restarted Arnoldi (Krylov- 



Schur) approach 2l|, l28l |. Preliminary tests indicate that similar procedures for 
tensors may be efficient. The properties of such methods and their implementation 
will be studied. 

3. The efficiency of the different variants of Krylov methods in terms of the number of 
tensor-vector-vector operations, and taking into account the convergence rate will 
be investigated. 
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